[Bug] Fix skip() api maybe skip unexpected bytes which makes inconsistent data#40
[Bug] Fix skip() api maybe skip unexpected bytes which makes inconsistent data#40frankliee merged 3 commits intoapache:masterfrom
Conversation
|
Could you add a ut for this case? |
Codecov Report
@@ Coverage Diff @@
## master #40 +/- ##
============================================
+ Coverage 56.81% 56.83% +0.01%
- Complexity 1203 1207 +4
============================================
Files 152 152
Lines 8401 8437 +36
Branches 813 819 +6
============================================
+ Hits 4773 4795 +22
- Misses 3369 3380 +11
- Partials 259 262 +3
Continue to review full report at Codecov.
|
The fix is target to potential problem with this api, and it can't be reproduced. |
|
Actually, I'm not quite sure if the skip() cause the problem, but with the fix, we can check it by the new log. |
|
Do the Hdfs |
| @@ -41,7 +41,22 @@ public LocalFileReader(String path) throws Exception { | |||
|
|
|||
| public byte[] read(long offset, int length) { | |||
| try { | |||
There was a problem hiding this comment.
at -> targetSkip or maxSkip ?
Hdfs has its own api |
|
LGTM, I try to reproduce this problem, but I fail. |
|
@colinmjj Do we need to backport this patch to branch 0.5? |
sure, I'll create a new PR for the backport |
…tent data (apache#40) ### What changes were proposed in this pull request? Fix bug when call `inputstream.skip()` which may return unexpected result ### Why are the changes needed? Get exception messages as following, and it maybe caused by unexpected data from `Local` storage ``` com.tencent.rss.common.exception.RssException: Unexpected crc value for blockId[9992363390829154], expected:2562548848, actual:2244862586 at com.tencent.rss.client.impl.ShuffleReadClientImpl.readShuffleBlockData(ShuffleReadClientImpl.java:184) at org.apache.spark.shuffle.reader.RssShuffleDataIterator.hasNext(RssShuffleDataIterator.java:99) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? With current UTs
…tent data (#40) (#52) ### What changes were proposed in this pull request? Fix bug when call `inputstream.skip()` which may return unexpected result ### Why are the changes needed? Get exception messages as following, and it maybe caused by unexpected data from `Local` storage ``` com.tencent.rss.common.exception.RssException: Unexpected crc value for blockId[9992363390829154], expected:2562548848, actual:2244862586 at com.tencent.rss.client.impl.ShuffleReadClientImpl.readShuffleBlockData(ShuffleReadClientImpl.java:184) at org.apache.spark.shuffle.reader.RssShuffleDataIterator.hasNext(RssShuffleDataIterator.java:99) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? With current UTs
|
Yes, this problem still exist after this patch. And it has no exception log in the server side. @jerqi @colinmjj |
…1632) ### What changes were proposed in this pull request? In our cluster, delete pod is denied by web hook, even though all application is deleted for long time. When I curl http://host:ip/metrics/server, I found app_num_with_node is 1. The problem is some application is leaked. I see many duplicated logs `[INFO] ShuffleTaskManager.checkResourceStatus - Detect expired appId[appattempt_xxx_xx_xx] according to rss.server.app.expired.withoutHeartbeat`. When I jstack the server many times, clearResourceThread will be stuck forever, here is the call stack. ``` "clearResourceThread" #40 daemon prio=5 os_prio=0 cpu=3767.63ms elapsed=5393.50s tid=0x00007f24fe92e800 nid=0x8f waiting on condition [0x00007f24f7b33000] java.lang.Thread.State: WAITING (parking) at jdk.internal.misc.Unsafe.park(java.base@11.0.22/Native Method) - parking to wait for <0x00007f28d5e29f20> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(java.base@11.0.22/LockSupport.java:194) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base@11.0.22/AbstractQueuedSynchronizer.java:885) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(java.base@11.0.22/AbstractQueuedSynchronizer.java:917) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@11.0.22/AbstractQueuedSynchronizer.java:1240) at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(java.base@11.0.22/ReentrantReadWriteLock.java:959) at org.apache.uniffle.server.ShuffleTaskManager.removeResources(ShuffleTaskManager.java:756) at org.apache.uniffle.server.ShuffleTaskManager.lambda$new$0(ShuffleTaskManager.java:183) at org.apache.uniffle.server.ShuffleTaskManager$$Lambda$216/0x00007f24f824cc40.run(Unknown Source) at java.lang.Thread.run(java.base@11.0.22/Thread.java:829) ``` Apparently there's a lock that's not being released. Looking at the code, it's easy to see that the read lock in the flushBuffer is not released correctly. The log ` ShuffleBufferManager.flushBuffer - Shuffle[3066071] for app[appattempt_xxx] has already been removed, no need to flush the buffer` proved it. Fix: #1631 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? no test, obvious mistake
…ed. (apache#1632) ### What changes were proposed in this pull request? In our cluster, delete pod is denied by web hook, even though all application is deleted for long time. When I curl http://host:ip/metrics/server, I found app_num_with_node is 1. The problem is some application is leaked. I see many duplicated logs `[INFO] ShuffleTaskManager.checkResourceStatus - Detect expired appId[appattempt_xxx_xx_xx] according to rss.server.app.expired.withoutHeartbeat`. When I jstack the server many times, clearResourceThread will be stuck forever, here is the call stack. ``` "clearResourceThread" apache#40 daemon prio=5 os_prio=0 cpu=3767.63ms elapsed=5393.50s tid=0x00007f24fe92e800 nid=0x8f waiting on condition [0x00007f24f7b33000] java.lang.Thread.State: WAITING (parking) at jdk.internal.misc.Unsafe.park(java.base@11.0.22/Native Method) - parking to wait for <0x00007f28d5e29f20> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(java.base@11.0.22/LockSupport.java:194) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base@11.0.22/AbstractQueuedSynchronizer.java:885) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(java.base@11.0.22/AbstractQueuedSynchronizer.java:917) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@11.0.22/AbstractQueuedSynchronizer.java:1240) at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(java.base@11.0.22/ReentrantReadWriteLock.java:959) at org.apache.uniffle.server.ShuffleTaskManager.removeResources(ShuffleTaskManager.java:756) at org.apache.uniffle.server.ShuffleTaskManager.lambda$new$0(ShuffleTaskManager.java:183) at org.apache.uniffle.server.ShuffleTaskManager$$Lambda$216/0x00007f24f824cc40.run(Unknown Source) at java.lang.Thread.run(java.base@11.0.22/Thread.java:829) ``` Apparently there's a lock that's not being released. Looking at the code, it's easy to see that the read lock in the flushBuffer is not released correctly. The log ` ShuffleBufferManager.flushBuffer - Shuffle[3066071] for app[appattempt_xxx] has already been removed, no need to flush the buffer` proved it. Fix: apache#1631 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? no test, obvious mistake
What changes were proposed in this pull request?
Fix bug when call
inputstream.skip()which may return unexpected resultWhy are the changes needed?
Get exception messages as following, and it maybe caused by unexpected data from
LocalstorageDoes this PR introduce any user-facing change?
No
How was this patch tested?
With current UTs